On the optimal contact potential of proteins 
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We analytically derive the lower bound of the total conformational energy of a protein structure 
by assuming that the total conformational energy is well approximated by the sum of sequence- 
dependent pairwise contact energies. The condition for the native structure achieving the lower 
bound leads to the contact energy matrix that is a scalar multiple of the native contact matrix, i.e., 
the so-called Go potential. We also derive spectral relations between contact matrix and energy 
matrix, and approximations related to one-dimensional protein structures. Implications for protein 
structure prediction are discussed. 
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Proteins' biological functions are made possible by 
their precise three-dimensional (3D) structures, and each 
3D structure is determined by its amino acid sequence 
through the laws of thermodynamics Therefore, 
predicting protein structures from their amino acid se- 
quences is important not only for inferring proteins' bi- 
ological functions, but also for understanding how 3D 
structures are encoded in such one-dimensional informa- 
tion as amino acid sequence. The problem of protein 
structure prediction is naturally cast as an optimization 
problem where a potential function is minimized. Given 
an appropriate potential function, conformational opti- 
mization should yield the native structure as the unique 
global minimum conformation of the potential function. 
Thus, the problem has been traditionally divided into two 
sub-problems: One is to establish an appropriate poten- 
tial function and the other is to develop the meth- 
ods to efficiently search the vast conformational space of 
a protein [3j. Among various forms of effective energy 
functions, statistical contact potentials [J, l5[ have been 
widely used. In this Letter, we exclusively treat a class 
of such contact potentials, neglecting other contributions 
such as electrostatics and local interactions. Accordingly, 
a protein conformation is represented as a contact ma- 
trix in which the element is 1 if the residues i and 
j are in contact in space, otherwise it is 0. Although 
the contact matrix is a coarse-grained representation of 
protein conformation, it has been known that the con- 
tact matrix contains sufficient information to recover the 
three-dimensional (native) structure of proteins Q . It is 
noted that, for the lattice model of proteins [7|, these rep- 
resentations of protein conformation and energy function 
are exact. 



Lower bound of contact energy 

Our fundamental assumption is that the conforma- 
tional energy of a protein can be somehow expressed 
in terms of a contact matrix. Now let us assume that 
the total energy of a protein can be well approximated 
by the sum of pairwise contact energies between amino 
acid residues, and that each pairwise contact energy can 
be decomposed into a sequence-dependent term and a 
conformation-dependent term. The sequence-dependent 
term is expressed as a matrix £ (S) — (£ij) which we 
call the contact energy matrix, or E- matrix for short. 
Each element £ ij of the £?-matrix represents the energy 
between the residues i and j when they are in contact. 
This form of the £?-matrix is a very general one: Each 
element, £ij, may depend on the entire sequence, S, or it 
may depend only on the types of the interacting amino 
acid residues, i and j, as in the conventional contact po- 
tentials. The conformation-dependent term is expressed 
as another matrix A(C) = (Ay) which we call the con- 
tact matrix, or C-matrix. Each element Ay of the C- 
matrix assumes a value of cither 1 or 0, depending on 
the residues i and j are in contact or not, respectively. 
Hence the total energy E(C, S) of a protein of sequence 
S of N residues and having conformation C is given by 

N N 



E{C,S) = ~ £ A« (C) (1) 

(2) 



i=\ j=i 



~[£(S),A(C)] 



where [•, ■] denotes the Frobenius inner product between 
two matrices [1, 0]. Based on this assumption, we de- 
rive the lower bound for the conformational energy and 
the conditions for the native structure and i?-matrix to 



2 



achieve the bound. 

The Frobenius inner product leads to the matrix I2 
norm defined as, for a matrix M, \\M\\ = [M, M} 1/2 = 
(Sj j Mij) 1 ^ 2 - I n the case of C-matrix, since Ay = or 
1, we have 



||A(C)|| 2 = 27V c (C) 



(3) 



where N c = (1/2) • Ay is the total number of 
contacts. As for any inner products, the Frobenius 
inner product satisfies the Cauchy-Schwarz inequality 
(P,-B]| < PIHI-BII) from which we have 



[£,A]>-||£||||A|| 
where the equality holds if and only if 
£ = eA 



(4) 



(5) 



for some scalar e < 0. Although the inequality (Eq. [4]) 
holds for any pair of matrices, we now regard it as the 
lower bound for conformational energy for a given E- 
matrix. For simplicity, we first consider the energy min- 
imization problem for conformations with ||A(C)|| fixed 
to the value of the native conformation. It is desirable for 
the native conformation to satisfy the lower bound and 
hence its condition Eq. If the native conformation 

indeed satisfies the condition Eq. (|5|), then the elements 
of the E-matrix is either or e so that only the contacts 
present in the native conformation are stabilizing. Thus, 
the native conformation satisfying Eq. ([5]) is actually a 
GMEC among any conformations with arbitrary values of 
||A(C)||. An _E-matrix that satisfies Eq. for the native 
C-matrix is a kind of the so-called Go potential 1(| 11 1 
which has been essential for studying the protein folding 
problem. At this point, it is still possible that the na- 
tive structure is not the unique GMEC. For example, if 
a conformation contains all the native contacts together 
with some other contacts, this conformation has the same 
energy as the native conformation. In order for a native 
conformation to be the unique GMEC, it is required that 
the total number of contacts of the native conformation 
is larger than that of any other conformations that con- 
tain all the native contacts. From the relation Eq. ([3]), 
maximizing the total number of contacts is equivalent to 
maximizing the norm of the C-matrix, which in turn im- 
plies the minimization of the right-hand side of Eq. ([4]). 
To summarize, for a given _E-matrix, £(S), of a protein, 
its native conformation, C n , achieves the lower bound in 
Eq. (H|) if and only if £(S) = eA(C„) for some e < 0, and 
such native structure is the unique GMEC if and only if 
||A(C„)|| is the maximum of all possible conformations 
that contain all the native contacts. Note that the former 
condition is a relation between i?-matrix and C-matrix 
whereas the latter is a condition for a native structure to 
satisfy. The magnitude of s is not specified here, but it 
should be determined by other factors such as the folding 



temperature. It should be noted that a native structure 
can be the unique GMEC without achieving the lower 
bound of Eq. ((4|). Such a case is made possible either 
by the limitation of the conformational space imposed by 
other steric factors such as chain connectivity or excluded 
volumes, or by inherent inconsistencies of the iS-matrix 
so that no plausible conformations are allowed to satisfy 
the lower bounds. 



Spectral relations 

To examine more closely how the lower bound can be 
achieved, we next derive a more generous lower bound in 
a more restricted case. First, the C-matrix is decomposed 
as 



N 

A = J> 

a=l 



u„v„ 



(6) 



where a a is the a-th singular value and u Q and v a are 
the corresponding left and right singular vectors, respec- 
tively. U = (ui, • ■ ■ , Ujv) and V = (vi, • • ■ , vjv) are or- 
thogonal matrices. The singular components are sorted 
in decreasing order of the singular values: o\ > ■ ■ ■ > 
cjv(> 0). Since A is real symmetric, the singular values 
are the absolute values of the eigenvalues of A, and the 
singular vectors are such that u Q = ±v Q where the sign 
corresponds to that of the respective eigenvalue. Next, 
the i?-matrix is decomposed in the same manner as 



N 



(7) 



where r Q are singular values, and x a and y a are left and 
right singular vectors, respectively. Since £ is also real 
symmetric, the singular components have the same prop- 
erties as the C-matrix A. Noting that [£, A] = tr(£A T ), 
von Neumann's trace theorem [9(] leads to the following 
inequality: 



[£, A] > - a <*T a 



a = l 



where the equality holds if and only if 

( U Q X /3)(vJy/3) = -5 a £ 



(8) 



(9) 



for all a and (3 with non-zero singular values a a and 7]a 
(5 a< p is Kronecker's delta). We now regard this inequal- 
ity as a lower bound for the conformational energy for 
a given E-matrix. For a fixed set of the singular values 
a a (a = 1, • • • , N), if and only if there exists such a con- 
formation that satisfies the condition in Eq. Q, then 
that conformation is the lowest possible energy confor- 
mation. Let A Q and s a (a = 1, • • ■ , N) be the eigenval- 
ues of the C-matrix and E-matrix, respectively, sorted 
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in the decreasing order of their absolute values. Then 
ff a = |A Q | and r a — \e a \ for a — 1, ■ ■ ■ , N , and u Q and 
the eigenvectors of the corresponding matrices. 
Thus, in terms of eigenvalues and eigenvectors, the lower 
bound in Eq. (jHJ) is equal to J2 a ^a£a with X a e a < for 
a = 1, • • • , N. In addition to the condition Eq. © for 
the lower bound of Eq. J5J), if A and £ are of the same 
rank, then the numbers of positive, negative, and zero 
eigenvalues of A and —£ are the same and u Q = ±x Q . 
Thus, from Sylvester's law of inertia there exists a 
real non-singular matrix S such that 



e 



-SAS T , 



(10) 



i.e., the i?-matrix is *congruent to the C-matrix. If the 
conformation that satisfy the condition Eq. (JTUJ) is the 
native structure, the _E-matrix is consistent in the sense 
that the contributions from all the eigencomponents are 
stabilizing the native structure (A Q e Q < 0). Since the 
matrix S is non-singular, we can "predict" the native 
structure from the E matrix as A = —S~ 1 £S~ T (if we 
can construct the appropriate matrix S). At this point, 
however, the native structure may not be the GMEC 
since other conformations with a different set of singular 
values may have lower energies. 

In order to compare the energies of conformations with 
different sets of singular values, we use another inequal- 
ity @: 



>-ll£UIIA| 



(11) 



where the lower bound is the same as that in Eq. f4|. 
We note that, in terms of singular values, the matrix 
norms are expressed as ||A|| = (J2 a <J a) 1 ^ 2 an( l ll^ll = 
(J2 a T a) 1 ^ 2 - Hence, it is clear that the equality in Eq. 
(fTTj) holds if and only if, in addition to the condition 
in Eq. ©, there exists a scalar constant c such that 
r a = ca a for all a = 1, • • • , N. These conditions are 
equivalent to Eq. ([5]). 



One-dimensional approximations 

To connect the present results with previous studies, 
we next introduce two approximations. First, we con- 
sider the case where the i?-matrix is well approximated 
by its principal eigencomponcnt, that is, £ k EiXiX^ . 
This approximation is motivated by the eigenvalue anal- 
ysis of the Miyazawa-Jernigan (MJ) contact potential [H 
performed by Li et al. [12J, and has been employed by 



others [13, [14j, |15| . In this case, the lower bound Eq. 
© is achieved if and only if xi = ±m and EiAi < 0. 
This result was previously derived by Cao et al. [l3| who 
subsequently showed that the vector Xi constructed by 
using the components of the principal eigenvector of the 



MJ contact potential is indeed highly correlated with the 
principal eigenvector of the native contact matrices 14 1 . 
Bastolla et al. [H[ obtained a similar result, but they also 
showed that taking the average of such Xi over evolution- 
arily related proteins greatly improved the correlation. 
Since the rank of the contact matrix is in general not 1 , 
Eq. (fT0|) does not hold and the equality in Eq. Q can- 
not be satisfied. Consequently, there are attractive inter- 
actions between non-native contacts even when xi = Ui 
holds exactly. Nevertheless, Porto et al. [H| have demon- 
strated that the knowledge of Ui alone is practically suffi- 
cient for reconstructing the native contact matrix of small 
single-domain proteins. Therefore, construction of effec- 
tive rank-1 E- matrices is of great interest [Tjf. Based on 
the Porto et al.'s result, it is tempting to postulate that 
the satisfaction of the lower bound by a rank-1 i?-matrix 
is sufficient for the native conformation to be the unique 
GMEC. At present, however, there is no clear connection 
between the present formulation (energy minimization) 
and the Porto et al.'s combinatorial algorithm. 

Another approximation is a kind of mean- field approxi- 
mations in which the matrix element £ij is replaced by its 
average over column (£;,) = S^Li &ij /N . Let us define 
e = ((f !.),•■• ,(£ Nm )) T and n = (rai,--- ,n N ) T where 
rii = SjLi ^ij i s the contact number of the i-th residue. 
Then, we have the following approximation and the lower 
bound: 



E(C,S) « l -e T n 



> 



(12) 
(13) 



where the equality in (|13p holds if and only if the column- 
averaged _E-matrix is anti-parallel to the contact number 
vector, that is, e = en for some e < 0. This lower bound 
condition is analogous to Eq. ([5]), and can be regarded 
as another kind of the Go potential for one-dimensional 
protein structure. It has been suggested that contact 
number vector can significantly constrain the conforma- 
tional space [III]. Together with other one-dimensional 
structures, contact number vector is also used for recov- 
ering the native structures |l9l|. and can be accurately 
predicted [13, ED, 0, HI HlP It has been pointed out 
that the contact number vector is highly correlated with 
the principal eigenvector of the C-matrix 16|, 1||, which 
suggests that this mean-field approximation is qualita- 
tively similar to the principal eigenvector approximation 
introduced above. 



DISCUSSION 

Using a more restricted, but conventional, form of the 
_E-matrix where each element £y depends only on the 
types of i-th and j-th residues (e.g., the MJ potential), 
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Vendruscolo et al. [25j, [26| have shown that it is im- 
possible for such i?-matrices to stabilize all the native 
structures in a database. The conventional ©-matrices 
such as those they studied do not take into account the 
sequence-dependence beyond a summation of the con- 
tributions from residue pairs. In the present study, we 
assumed a more general form for the ©-matrix, allow- 
ing each element £y- to depend on the whole amino acid 
sequence. In practical situations of protein structure pre- 
diction, we want to optimize an energy function so that 
the native conformations of arbitrary proteins achieve 
the lower bound. Now let us impose this as a requisite 
for the ©-matrix. Then, there should exist a function, 
namely £, that maps each amino acid sequence to the 
corresponding optimal ©-matrix, that is, the Go poten- 
tial. Thus, the problem of structure prediction becomes a 
trivial matter. Currently, most efforts for developing en- 
ergy functions seem to be focused on accurate estimation 
of a fixed set of parameters for a given functional form Q . 
The present analysis suggests that inferring the function 
£ that can generate the Go-like ©-matrices from amino 
acid sequences is essential if a contact potential is used. 
The lower bound inequality (Eq. [4]) and its condition for 
the equality (Eq. [5]) will serve as the guiding principle 
for inferring such a function. This approach to structure 
prediction is apparently similar to machine-learning ap- 
proaches to contact matrix prediction [17|, |27[ . Although 
conventional machine-learning methods are not directly 
targeted at the optimization of the form of Eq. ([4]), their 
prediction accuracy should be indicative of the possibility 
for identifying the function £ . 

In the preceding paragraph, we have assumed the exis- 
tence of the function £ to construct the optimal contact 
potential from a given amino acid sequence. What if, 
however, there is no such function? In fact, the limited 
success of current contact matrix prediction [28| strongly 
suggests that this is more likely the case. Such a case 
implies either that there are proteins for which the lower 
bound energy cannot be achieved, or that the total en- 
ergy cannot be sufficiently accurately approximated by 
Eq. ([I]). The former case indicates that some proteins 
are inherently frustrated, but to a good approximation 
such proteins should be rather exceptional for natural 
proteins [Hi [HI]. The latter case may indicate that multi- 
body contact interactions [2§| and/or other energy com- 
ponents than contact energies are more important. 

In summary, we have shown that the requirement for 
the native structure to achieve the lower bound naturally 
leads to the Go potential and the requirement for such a 
conformation to be the unique GMEC leads to the native 
conformation being the most compact one among those 
containing all the native contacts. These results suggest 
that protein structure prediction should be possible sim- 
ply by constructing the optimal energy matrices or that 
the contact potential alone is not suitable for the prob- 
lem. Although not yet definitive, the current state of 



contact prediction 28| as well as recent studies on local 
interactions [3(], [H | suggest that the latter may be the 
case. Nevertheless, the present results may be useful for 
evaluating the optimality of potential functions in either 
case. 
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